arXiv:1507.03194v2 [stat.ML] 28 Aug 2015 


A Review of Nonnegative Matrix Factorization 
Methods for Clustering 


Ali Caner Turkmen 

Department of Computer Engineering 
Bogazici University, Istanbul, Turkey 
caner.turkmenSboun.edu.tr 

August 31, 2015 


Abstract 

Nonnegative Matrix Factorization (NMF) was first introduced as a 
low-rank matrix approximation technique, and has enjoyed a wide area 
of applications. Although NMF does not seem related to the cluster¬ 
ing problem at first, it was shown that they are closely linked. In this 
report, we provide a gentle introduction to clustering and NMF before 
reviewing the theoretical relationship between them. We then explore 
several NMF variants, namely Sparse NMF, Projective NMF, Nonneg¬ 
ative Spectral Clustering and Cluster-NMF, along with their clustering 
interpretations. 


1 Introduction 


Clustering, the problem of partitioning observations with high intra-group sim¬ 
ilarity, has always been one of the central themes in unsupervised learning. 
Although a wide variety of algorithms for clustering applications have been 
studied in literature, the subject still remains an active avenue for research. 

In fact, it is possible to formulate clustering as a matrix decomposition 
problem. This formulation leads to interesting interpretations as well as novel 
algorithms that beneht from favorable computational properties of numerical 
linear algebra. 


Popularized by Lee and Seung 11 , Non-negative Matrix Factorization (NMF) 


has turned into one of the primary tools for decomposing data sets into low-rank 
factorizing matrices in order to yield a parts-based representation. It has been 
shown not long after, that NMF and variants are well-performing alternatives to 
well-known clustering algorithms and that close theoretical links exist between 
the two problems. Drawing on this link, researchers have offered a variety of 
perspectives and different methods in applying NMF for clustering. 

This report aims to make a gentle introduction to how the clustering prob¬ 
lem can be interpreted in a matrix factorization setting. A variety of NMF 
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formulations will be presented, along with the intuition about their clustering 
interpretations. The algorithmic solutions and implementation details of these 
techniques will be left out with pointers to the relevant literature. 

This report is structured as follows. We introduce the clustering problem 
and several solution strategies in Section We then introduce NMF and es¬ 
tablish the link between /c-means and NMF, in SectionWe then work on this 
link to introduce several NMF formulations geared towards solving the problem 
in Section Namely, we cover Sparse NMF, Projective NMF, Nonnegative 
Spectral Clustering and Cluster-NMF. We conclude the report in Section 

I. 1 A Word on Notation 

Capital letters of the Latin and Greek alphabets denote matrices (e.g. A, F, S), 
while lowercase bold typeface letters denote vectors (e.g. x). Vectors that 
correspond to a matrix are column vectors of that matrix, indexed accordingly. 
That is, xi is the first column of matrix X. Xij will be used to denote the Tth 
row, j-th column element of matrix X. Script capital letters denote sets (e.g. 
C). 

Throughout the document, ||.|| will be used for the L 2 norm of a vector, 

II. lli? for the Frobenius norm of a matrix. tr{.) stands for the trace operator, i.e. 
the sum of the elements along the diagonal of a square matrix. diag{.) is used 
in a similar sense to popular linear algebra software, and will denote a diagonal 
matrix constructed on the vector or list of scalars provided. 

Finally, vector dimensions will also be reused. In the data clustering prob¬ 
lem, n will be introduced as the number of observations, m as the number of 
features (i.e. the dimension of each observation), and k as the number of clusters 
or corresponding low-rank interpretation in NMF. 


2 Clustering 

In this section, we provide a gentle introduction to the clustering problem, and 
introduce the notation we will use when discussing NMF applications. 

Given a set of data points x^ € T for 1 < i < n, clustering algorithms aim 
to find partitionings of the set such that the similarity within each partition 
is maximized and similarity between different partitions is minimized. This 
general problem definition can be attacked with various techniques to generate 
a variety of different partitionings on the same set. 

This simple idea has a wide array of application areas including management 
science, business intelligence, pattern recognition, web search, biostatistics. For 
example, applied to a database of customers clustering algorithms yield groups 
of similar customers which may better respond to targeted marketing campaigns. 
In musical information retrieval, clustering can be used to group together songs 
for similar audiences. 

Clustering, or cluster analysis, is an unsupervised learning problem. That 
is, clustering techniques address unlabeled data, and try to work with different 
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objective functions that do not involve a ground truth. 

Although it is difficult to provide a definitive taxonomy of clustering al¬ 
gorithms, it is useful to provide some definitions on how they are often cat¬ 
egorized. The two primary families of clustering algorithms are partitioning 
methods and hierarchical methods. Partitioning methods partition the data 
into k groups, and iteratively relocate the observations until some dissimilarity 
between groups and similarity within groups is achieved. This is in contrast 
to hierarchical clustering, where the algorithm decomposes the data set hier¬ 
archically into partitions. In the agglomerative approach, single data points 
are merged into groups one by one until a termination condition holds. Start¬ 
ing from observations and building partitions by merging small groups, these 
methods are also called bottom-up. A contrast is divisive or top-down methods 
which start with all nodes in the same cluster and partition clusters until some 
termination condition occurs. 

Another divide between clustering methods we will find useful is hard clus¬ 
tering versus soft clustering. Hard clustering methods adopt exclusive cluster 
assignment, in that at any step of the algorithm one observation can belong to 
one and at most one cluster. Soft clustering is a relaxation on this constraint, 
and each observation can belong to different clusters in proportion to a set of 
weights. 

In the following sections, we will introduce the most popular partitioning 
problem, fc-means and Kernel Ic-means, a variant. This will be useful as we will 
reference these techniques numerous times when discussing NMF applications 
in clustering. 


2.1 /c-means 

fc-means is perhaps the most popular clustering method, and the first one that 
comes to mind. Suppose a data set X contains n observations denoted by 
Xi € M™. fc-means aims to find a partitioning of this data set Ci,C 2 , ■■■,Ck such 
that Ci C X and CiHCj = 0 for all I < i,j < k. The technique uses an objective 
function given below to minimize dissimilarity within k disjoint subsets Ci'. 


min > > l|x — 

CuC2,-,Ck 

1—1 xGCi 




( 1 ) 


where pn is the centroid of data points assigned to cluster Ci. In /c-means, as 
the name implies, the centroid of cluster Ci is defined as the mean of all points 
assigned to it. In this regard, fc-means tries to minimize within cluster variation, 
or simply the sum of squared error within each cluster. 

Algorithms for solving fc-means are some of the earliest machine learning 
algorithms that have survived to date. A high-level description of the naive 
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algorithm can be found in Algorithmj^ Early derivations can be found in 
Data: k: number of clusters 
X : set of data points 
Result: Set of k clusters {Ci,C 2 ,C^} 
arbitrarily choose k objects as initial cluster centroids; 
while cluster centroids change do 

assign each point x G X to cluster Ci with the closest (by Euclidean 
distance) cluster centroid Hi ; 

recalculate /i^ as the mean of data points assigned to Ci, for all 
I < i < k ; 

end 

Algorithm 1: A:-means algorithm 

Expressing A:-means in vectorized form will help greatly in kernelizing a 
solution as well as expressing it in matrix factorization form. 

Let us define cluster membership matrix B G {0,1}”^^, such that the matrix 
element Bij = 1 if observation i is assigned to cluster j, and 0 otherwise. We 
will refer to this matrix numerous times throughout this document, so it makes 
sense to explore its properties. First, observe that each row of B can have only 
one element taking the value 1, and all other elements must be equal to 0. This 
is due to the definition of hard clustering. As such, the columns of matrix B 
are orthogonal. 

Consequently we also have ^ij = \^j\- We now introduce the matrix 
D = diag{l/\Ci\, I/IC 2 I,..., l/|Cfe|) 

Observe that we could have equivalently written \Ci\ = ||bi|p for all i. 

Let us calculate oi, defined as Note that calculating BD^, we 

have effectively normalized the columns of B as well, so they are orthonormal. 
Equivalently, {BD^)"’'BD^ =D^B^BD^ = I. 

Finally, we arrange our observations as the columns of a data matrix X such 
that 



A = [xi,X2, ...,X„] 

Briefly revisit 0 . Note that, due to the definition of fc-means, we define the 
cluster centroids as 


E ^ 

' xeCi 

We are now equipped with all the tools we need to express A:-means in 
vectorized form. Observe that the matrix product XBD yields a matrix which 
has cluster centroid vectors fii arranged as its columns. We are looking to 
express the Euclidean distance of each observation from the cluster center it 
is assigned to. We can simply do this by calculating XBDB^. We have now 
effectively cast the cluster centroids to and arranged a matrix where each 

column i corresponds to the cluster centroid that the data item x^ is assigned 
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to. We can now rewrite the clustering objective Q in the minimization problem 
below: 


mm\\X - XBDB^Wl (2) 

B 

This is the same sum of squared error given in the original function. Also 
observe that the objective function is only dependent on B, since D is only 
defined as the inverse squared column norms of B arranged on the diagonal of 
a matrix. 

Advancing this notation further, we know that we could write ||A|||. = 
tr{A^A) for any matrix A. Using this property, we write the same problem in 
the form: 


mintr{{X - XBDB^f{X - XBDB'^)) 

B 

As shown in [5p4] , we can show that solving this form is equivalent to solving 
nun a:) - tr{X'^XBDB^) 

This derivation is more involved, and is presented in Appendixj^ as Lemma 

Observe that the first term in the new objective function is a constant, and 
the problem can safely be reexpressed as: 

ina,xtr{X"'"XBDB"’") (3) 

We will see in later sections that this set of derivations lend themselves 
to extensions for matrix factorization forms as well as spectral relaxation and 
application of the kernel trick. We can now move on to introduce, by means of 
this notation, how the fc-means problem can be extended with kernel functions. 


2.2 Kernel fc-means 


At its simplest form, fc-means clustering is only capable of calculating spherical 
clusters. An extension to fc-means for addressing more complex, non-linearly 
bounded clusters is by means of the kernel trick, introduced in this section. 
After a brief introduction to the kernel trick, we will move to define kernel 
fc-means. 

In many machine learning tasks, a given set of data points x S A are trans¬ 
formed into feature vectors via a feature map, cj) : X ^ X where X is the feature 
space 19 . However, for describing highly non-linear interactions and patterns 


in data, the feature map may end up producing feature vectors of much higher 
dimension. To avoid the associated computational cost, one may construct a 
function AT : A x A —)■ K such that 


K{x,y) = (j){y) 

Having constructed this function, often via the use of domain knowledge, it 
will now take only Oln?) time in terms of inner products to compute a Gram 
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matrix, where n is the number of observations. In computing this matrix, the 
data items have been implicitly transformed to a feature space of possibly infinite 
dimension. 

More intuitively, kernel functions often have implications as measures of 
similarity between vectors x and y. Machine learning methods which rely on the 
kernel trick to perform highly non-linear separations are called kernel methods. 

When clusters are spherical, dense and linearly separable, the naive fc-means 
algorithrrj^ can be expected to give a fair representation of clusters. However, 
if the clusters are of arbitrary shape, and are only non-linearly separable, then 
a different approach is required. It will be shown that the /c-means algorithm 
can be kernelized in order to achieve that effect. 



Figure 1: Linear Separability of Data 

For motivation, take Figure It is clear that while the second blobs data 
set is linearly separable into meaningful clusters, a non-linear decision boundary 
is required to partition the first data set into the two circles. 

It is for this purpose we extend the standard fc-means formulation to kernel¬ 
ized form, as given in |^[^. Let us introduce the matrix 

$ = [0(xi),(^(x2),...,(?i(x„)] 


Intuitively, $ is a matrix that has the observations x S df, transformed via 
the feature map (f) to arbitrary dimension in its columns. 

Recall the line of derivations that led to ([^. Without applying the kernel 
trick, we would have used the matrix <I> instead of X in each step of the process, 
having explicitly transformed its features. We would have then ended up with 
the trace maximization problem: 

maxtr($^<I>HIIiI^) (4) 

^We refer here to the algorithm given in Algorithm]^ Note that fc-means is a problem, not 
an algorithm. An appropriate name for it could be the Lloyd algorithm |14| 
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However, note that by definition, is the kernel matrix, such that each 
element = (j){xi)'^(p^Xj) = K{xi,Xj). Given this fact, we never need to 

explicitly compute the matrix $. We could just compute the kernel matrix 
with the kernel function, and safely plug it in place of the linear inner product 
matrix X'^X. 

The kernel trick has a profound implication. Without the need for ever 
computing features in the feature space J-, the kernel matrix has the power of 
implicitly mapping data to arbitrary dimension. However, note that calculating 
the kernel matrix and performing matrix operations on it requires additional 
computational cost, as the kernel matrix has elements whereas the data 
matrix had mn elements. Especially for cases where the number of observations 
n is very large, kernel methods pose a disadvantage as they become increasingly 
difficult to compute. 


3 Non-Negative Matrix Factorization 


In machine learning, approximating a matrix by two factorizing low-rank ma¬ 
trices has many demonstrated benefits. Among these benefits is discovering 
structure in data, as well as reducing dimensionality and making way for better 
generalization. 

Perhaps one of the most popular methods geared towards low-rank approx¬ 
imation is Principal Component Analysis (PCA), a close relative of Singular 
Value Decomposition (SVD). At a high level, PCA works to identify the next 
best vector (or component) through the data that accounts for the highest vari- 
Once this is done, it then repeats the process for the residual variance. 


ance 
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As a direct and analytical result of applied linear algebra, PCA helps identify 
structure and reduce dimensionality. 


Popularized in 11 , Non-negative Matrix Factorization (NMF) is a low-rank 
approximation technique which introduces the constraint that the data matrix 
and the factorizing matrices are nonnegative. By allowing only additive linear 
combinations of components with nonnegative coefficients (i.e. a conical com¬ 
bination), NMF inherently leads to a parts-based representation. Compared 
to PCA, which is a holistic representation model, NMF leads to a much more 
intuitive and interpretable representation. 

Formally, NMF is characterized by the following factorization: 


A « WH (5) 

where X,W,H >0, X & W € e and k is the num¬ 

ber of components (the rank) in which the data matrix will be represented. 
Naturally, it makes sense that k < min(m, n). 

The approximation problem presented in is often formulated as an opti¬ 
mization problem of the form 


min D{X\\WH) 
W,H>0 
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where D{A\\B) is a divergence function. The most popular choice for the di¬ 
vergence is the Euclidean distance between X and WH, in which case the 
optimization problem becomes: 


^in^\\X-WH\\l = Y.i^-WH)l 


( 6 ) 




Several strategies and algorithms have been presented in literature towards 
the solution of this problem, including Constrained Alternating Least Squares in 
the original Paatero and Tapper presentation |l6p multiplicative update rules 
presented in [^, projected gradient methods Iffl- 

We dedicate the next few paragraphs to building intuition about the impli¬ 
cations of NMF. Assume each column of X is an image of m pixels. Then, using 
the approximation X k WH, one may say that the columns of W correspond to 
a basis image. Each column of H on the other hand, corresponds to an encoding 
of each image in terms of the basis images. 

Nonnegativity constraints on X, W, and H implicitly make for a more in¬ 
terpretable and parts-based representation of the data. W,H >Q implies that 
each element Xij = WihJ, is a weighted addition of positive vectors. By lim¬ 
iting both the basis and the encoding to nonnegative values, NMF forces the 
factorization to an additive weighted structure (encoding), of nonnegative build¬ 
ing blocks (bases). We will see in Section that this is a useful constraint in 
clustering. 

In contrast, PCA finds a basis composed of the eigenvectors of the covariance 
matrix -XX"^ corresponding to the k greatest singular values, guaranteeing 


the best representation in lower-rank 18 . However, the eigenvectors of the 
covariance matrix are not necessarily nonnegative. Intuitively, PCA bases do 
not represent an additive linear combination of the features, and the weight in 
one principal component can be canceled out by another. This prevents PCA 
bases to be readily interpreted as parts-of-whole representations. 

In turn, the nonnegativity in NMF implies that its use is constrained to 
cases where the data matrix is composed of nonnegative elements. In many real 
world scenarios, however, the nonnegativity of data points is inherent, such as 
pixel intensities in image processing, signal intensities, chemical concentrations, 
etc. 


4 NMF Applications to Clustering 

4.1 Intuition 

Having mentioned that NMF can be used to “discover structure in data”, we 
already have the intuition that it must have interpretations in a clustering set¬ 
ting. 

^in which the problem was called “Positive Matrix Factorization” 







We had introduced the /c-means problem earlier in Section 2.1 and expressed 
it in vectorized form. We will now elaborate that this form is easily interpretable 
in a matrix factorization framework. First, recall ([^: 


min \\X-XBDB'^\\j, 

Note that we have made the constraint that B has orthogonal columns ex¬ 
plicit. Let’s elaborate on this optimization problem. Evidently, we are trying to 
approximate a data matrix, X « XBDB^. If the rank of B is unconstrained, 
a trivial solution would be to take B — I, in which case D = I would hold by 
definition. This would be equivalent to assigning each observation to its own 
cluster. However, clearly we would like to summarize the data in fc < min(m, n) 
clusters. 

Compare the approximation problem X « XBDB^ to ([^. Assume our 
data matrix is A > 0. Then, by definition, the matrix of cluster centroids 
XBD > 0. Then we could compare XBD to the basis or W in the NMF 
formulation, and to H, the encodings. If such a representation is achievable, 
we have a solid way of learning clusters via favorable computational properties 
of numerical linear algebra. 

However, a critical constraint in the clustering problem prevents us from 
jumping to this conclusion. We defined each row of H as 0 except one element, 
which is 1. From here we found that B^BD^ = J. NMF, on the other 
hand, does not constrain to have orthogonal columns which would give us 
the needed clustering interpretability. 

Let us dive deeper into this problem. What does having orthogonal 
columns imply? We have argued that row-wise sparsity induces orthogonality 
of columns. This leads to the intuition that relaxing this orthogonality leads to 
denser rows. Remember each row of B is interpreted as the cluster assignment 
vector for some observation. A dense vector implies that the observation is not 
characterized as belonging to one cluster, but expressed as a weighting of several 
cluster centroids. If the original goal is clustering interpretability however, this 
is not a desirable outcome. 

Then, can we solve NMF while constraining the right factorizing matrix to 
have the same structure as It turns out this is also an intractable problem, 
and the orthogonality constraint must be relaxed in order to make the problem 
solvable in polynomial time. 

We are then left with the following intuition. If we can find ways to con¬ 
strain factorizing matrices to near-orthogonality, NMF provides the ground for 
clustering interpretations. Then simply selecting the maximum element’s index 
from the rows of B, we can directly use NMF methods for producing cluster 
assignments. 


4.2 Kernel /c-Means and Symmetric NMF Equivalence 

Ding et al. in have shown a stronger relationship between symmetric NMF 
and spectral /c-means clustering. They further demonstrate that solving the op- 
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timization problem presented in Q with additional constraints one can achieve 
symmetric NMF formulation, while the orthogonality of the factorizing matrices 
are preserved. In this section, we follow their footsteps in demonstrating the 
same equivalence. 

Before moving forward, note that symmetric NMF is simply the factorization 
a symmetric nonnegative matrix A to factors A « GG^. 

To pave the way, we work with our usual notation introduced in ([^: 


min \ \X - XBDB^ 

B 


|2 

If 


We had further demonstrated, with the help of Lemma that this was 
equivalent to solving 


mayitr{{BD^Y X^ XBD^) s.t. {BD^y BD^ =I 


(7) 


BD2 


In fact, this completes the expression of the fc-means problem introduced in 
Section |4.1 in spectral relaxation notation similar to . We had shown in 

that the inner product kernel matrix can safely be replaced with 


2.2 


Section 

other kernel matrices 

From this point on, we introduce a slight change in notation. We will call this 
kernel matrix A = We will absorb into B, and introduce G = BD^. 

Note that the definition of B and D immediately lead to, G^G = I. 

We can now demonstrate that solving symmetric NMF on the kernel matrix, 
such that A « GG^ is equivalent to Kernel fc-means. 

Theorem 1. (Ding, He and Simon) NMF of A = GG^ is equivalent to Kernel 
k-means with strict orthogonality constraint relaxed. 

Proof. Observe that equivalent to maximizing Q, one could minimize: 


G* = argmin —2tr{G'^ AG) 

G'^G=I,G>0 

= argmin -2tr{G^ AG) + \\G'^G\\% 

GTG=I,G>0 

= argmin \\A\\p - 2tr{G'^AG) + \\G'^G\\p 

G'^G=I,G>0 

= argmin P-GG^||| 

GTG=I,G>0 

Relaxing G^G = I completes the proof. 


□ 


As such, fc-means with orthogonality relaxed (i.e. soft fc-means) and symmet¬ 
ric NMF of the pairwise inner product matrix X'^X are theoretically equivalent. 
By use of kernel matrices, this is easily generalized to kernel fc-means. 

However, relaxing the orthogonality constraint G^G = I may result in the 
density issue discussed in Section |4.1| To that end, we must establish that 
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near-orthogonality or row-wise sparsity of G is enforced. We present another 
theorem in shown in to demonstrate symmetric NMF preserves the necessary 
near-orthogonality. 

Theorem 2. (Ding, He and Simon) Optimizing mine \ \A — GG'^\W retains the 
near-orthogonality G^G = I 

Proof. First observe that the solution G* = argmin^. \\A—GG'^\W is not unique. 
That is, multiple alternatives of G* are available for the case A « GG^. The 
next few equations will imply that among competing alternatives for G, solving 
for the above problem will favor near-orthogonal G. 

We had already introduced the cost function: 

J = Pill, - 2tr(GPG) + IIG^Gp 

Assuming A « GG^ we can write: 

J « PP - 2tr(GGP) + IIG^Gp 
J « ||Ap - 2trpP) + IIG^GIII, 

J « —pp -I- IIG’^Gp 


Then, the resulting problem can alternatively be written as: 

™*^G>0,AwGG'r I P^Gp 

Note that: 


IP^GIII. = ^(G^G)I^. = 5](gfg,P + ^ IlgiP (10) 

ij i^j i 

Minimizing the first term in the above addition leads to the desired near¬ 
orthogonality of columns, i.e. gfgj « 0 for all 1 < i < j < fc. The second term 
however, cannot be 0 as A « GG^. Note that: 


^ p « Y,{GG^)is = Y. Gl^G^s = 5] |g*p (11) 

Is Is lis i 

where |.| denotes the Li norm. This implies |gi| > 0 and consequently 
11 gill > 0. We are then left to conclude that minimizing J preserves near¬ 
orthogonality in the columns of G. □ 


As discussed in Section 4.1 this near-orthogonality (i.e. row-wise sparsity) 
is essential in a clustering interpretation. Shown above, expressing a somewhat 
relaxed fc-means as symmetric NMF preserves this requirement, thus the two 
techniques are theoretically equivalent. Hence we have established that fc-means 
and NMF have a strong theoretical link. We can now move to introduce various 
NMF methods that have been shown to perform well in clustering applications. 
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5 NMF Methods for Clustering 

In the last section, we demonstrated a high-level link between fc-means and non¬ 
negative matrix factorization. We then extended this argument to demonstrate 
that relaxing the hard clustering constraint on fc-means, we are able to show a 
theoretical equivalence between the method and symmetric NMF on the kernel 
matrix. 

Now, drawing on this theoretical link, we will introduce several NMF formu¬ 
lations and corresponding algorithms; and how their results have a clustering 
interpretation. 


5.1 Sparse NMF 


As discussed above, the matrix H can be interpreted as posterior cluster mem¬ 
bership probabilities, and sparsity on its individual rows must be imposed to 
achieve interpretability in clustering with NMF. 

The natural idea that follows this line of argument is adding sparsity con¬ 
straints or regularization to the cost-function optimized during NMF. One such 
formulation of the cost function is introduced in 
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mm - 
w,H 2 


\\A-WH^ 


\l 


■v\\w\\% + . 






s.t.W,H > 0 


( 12 ) 


where \H(j ,:) | refers to the Li norm of the j-th row of H. This regularization 
could be applied in many forms, including putting linin', or ||hi|| instead of 
the \H(j, :)p term. However, we seek row-wise sparsity in the right-factorizing 
matrix, and this formulation will fulfill that goal. 

Before moving to introduce Kim and Park’s solution strategy for the SNMF 
problem introduced above, let us introduce the standard NMF solution via 
Alternating Non-Negativity Constrained Least Squares (ANLS). Given the ob¬ 
jective function 

mmllX -WHill 

W,H 

iteratively solving the following two nonnegative least squares problems con¬ 
verges to a stationary point of the objective function: 


min \ 

W>0 

(13a) 

min \\WH - Alii 

H>0 

(13b) 


Note that the original cost function of NMF is non-convex, and non-convex 
optimization algorithms in this regard only guarantee stationarity of limit points, 
as is the case with NMF/ANLS. In other words, the algorithm is prone to con¬ 
verging to local minima, rather than the global minimum. 
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The authors of solve the above subproblems via the active set method 
introduced in detail in [^. In this paper, we will not focus on the internals 
of the minimization algorithms with respect to nonnegativity constraints, but 
rather the solution strategies with added regularization for sparsity. 

Take (12). It’s significance with respect to the clustering application is 
along with superior results presented. The cost function is 


discussed in 10 


minimized under the following two ANLS update rules: 


min 

H>0 


min 

w>o 



(14a) 

(14b) 


where G is a vector of all ones, Ofc G is a vector of all zeros, 

Ofexm G is a matrix of all zeros and Ik is the k x k identity matrix. 

These two subproblems are directly linked to the original subproblems in 
We are only embedding the regularization terms in the factorizing ma¬ 
trices. The row Sk ensures that additional cost is incurred on the Li norm of 
rows of H. Similarly, Ik stacked under H simply ensures = ||W|||. is 

minimized. 

Compared to spectral methods that will be introduced in later sections, the 
main advantage of SNMF is that there is no need to calculate the affinity matrix 
X"’"X, which may be computationally costly. Furthermore, the regularization 
terms introduce extra control parameters 77 , /3 that implementers can use to 
trade-off accuracy versus interpretability. Other applications of SNMF have 
been discussed in [^|^. 


5.2 Projective NMF 


Projective NMF (PNMF) was introduced by Yuan and Oja in 22 , for deriving 


basis vectors more suited towards learning localized representations. In a later 
publication [^, Yuan et al. demonstrate that PNMF is more closely related 
to fc-means clustering than ordinary NMF equivalence discussed in Section [4T| 
In this section, we introduce the PNMF formulation, discuss its equivalence 
to fc-Means clustering, and provide update rules for solving the approximation 
problem under Euclidean distance. 

Given a data matrix X, we are looking for a projection such that the columns 
of X are projected to some subspace of K™, and are still accurate representations 
of the original matrix. In other words, we are looking for a matrix P G 
such that X « PX. 

Now note that for any symmetric projection matrix of rank k there exists 
P = GG^, such that G^G = I and G G (see Lemma [^. We could then 

equivalently write 

X « GG^X 
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Let us go through the implications of this approximation one by one. X 
effectively defines the data items in the dimension R’". Writing GG^X, we 
simply cast this back to In solving for X « GG^X we’re simply looking 

to reduce the data matrix rank to fc, and still find a good representation of the 
original matrix. Another interpretation is that we hnd prototypical features 
from X that best represent the data in lower rank. 

We already know one good G that is capable of doing this. In fact, we can 
ensure that we can get the best representation possible in lower rank by setting 
G = U, where U € is left singular vectors of X with the k greatest 

singular values. This is a natural result of one of the staples of applied linear 
algebra: Singular Value Decomposition. 

However, singular vectors of X are not necessarily nonnegative. U"^X does 
not represent features built from additive combinations of the features of the 
original data matrix X. This lacks the parts based representation leading to 
interpretability we required in Section 

Alternatively, we could cast the problem X « GG^X in an NMF framework 
to find a basis that is both sparse and parts based. We can now introduce 
Projective NMF (PNMF). Rewritten as an optimization problem, minimizing 
Euclidean distance as a divergence, PNMF algorithms solve: 

minG>o\\X -GG^X\\% (15) 

Let us connect this argument to clustering. Above, we argued that PNMF 
finds prototypical features (or rather groups of features) that best represent 
data in lower rank. In clustering, our goal is similar, we aim to hnd prototypical 
cluster centroids, or groups of observations that represent the original data the 
best. One simple trick is now enough to connect PNMF to clustering: perform 
PNMF on the transposed data matrix A^! 

We can now review the projection argument. We seek to solve « PA^, 
where now P S R"^". We argued that any symmetric projection matrix can be 
written as P = GG^ such that G^G — I. The choice of notation is not at all 
coincidental. Using notation introduced earlier, we can now simply write: 


A^s 

s GG'^A'^ 

(16a) 

A ? 

s AGG'^ 

(16b) 

A f 

s xbd^d^b'^ 

(16c) 

A f 

s XBDB'^ 

(16d) 


recovering the clustering problem introduced in ([^. 

A variety of algorithms for solving PNMF with respect to different diver¬ 
gences have been presented in literature. Here we present only the multiplicative 
update algorithm for minimizing Euclidean distance. The proof of convergence 
regarding this algorithm, and variations can be found in 20 -23 25 . Given 


min||A-GG^A||| 

G 
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the cost function is non-increasing, and keeps C? > 0 under the update rule: 


C : c 

iGGTXXTG),j{XXTGG^G),j ^ ’ 

One of the main benefits arriving with PNMF is reducing the number of 
free parameters to be learned. In the classical NMF formulation, the number of 
free parameters is k(rn + n), which makes for ambiguity. Furthermore, it is not 
unusual that k(m -I- n) > mn, i.e. the number of free parameters is larger than 
the data matrix itself. However, with the above formulation, PNMF ensures 
that the number of parameters to be learned is km < mn. 

Finally, learning only one parameter matrix G means that any incoming new 
new data vector x can be predicted, or in our context, placed into a cluster, by 
calculating GG^x.^. This is in contrast to classical NMF, where both W and H 
are needed. 


5.3 Non-Negative Spectral Clustering 

Nonnegative Spectral Clustering (NSC) is an implementation of NMF on the 
spectral clustering problem, proposed by Ding et al. in [^. 

NSC is a graph theoretical approach which follows naturally from the equiv¬ 
alence presented in Section 4.2 or by the same authors in [^. While the original 
paper showing the equivalence does not detail an algorithmic solution, a more 
involved formulation and solution strategy is presented in [^. 

Before moving to detail NSC, let us try to interpret our problem in graph 
theoretical terms. We had introduced the affinity matrix A. We could safely 
think of this matrix as a weighted adjacency matrix of graph vertices all corre¬ 
sponding to a data observation. We know that H is a symmetric matrix, so it 
makes sense that our graph is undirected. 

Remember /c-means is equivalent to solving the trace maximization problem 
given in([^: 


max tr(X"’" XBDB"’") 
B 


maxtr{X'^XBDB"’") (18a) 

= maxtr{ABDB^) (18b) 


Let us walk through the matrix product in (18). Calculating AB would 
yield a n x k matrix, with each entry corresponding to the sum of affinity 
measures from a certain vertex to all vertices in a certain partition, or groups 
of vertices. Concretely, if we define the affinity between and Xj as d{xi,Xj), 
then {AB)ij = d{xi,x^). Computing ABD simply normalizes this to 

the average affinity: {ABD)ij = Finally, the diagonal 

elements of ABDB^ , for each observation Xi, correspond to the average affinity 
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from Xi, to all the observations in the same cluster as itself. Then {ABDB^)n = 


\Cp\ ^ 


Xm GCj 


c?(xi,x^) such that x^ G Cp. 


Intuitively, solving (18) one tries to assign clusters such that the average 
affinity to co-clustered observations is maximized. In somewhat lighter terms, 
we aim to maximize intra-cluster affinity. Finally, before moving on to the graph 
cut problem, let us recap our notation above: 


maxtr{ABDB^) (19a) 

k ^ 

= max^— d(x„x^) (19b) 

p—1 ^ XijX^GCp 

k ^ 

= max^— Y (19c) 

p=l x.i,x.fj,£Cp 

Let us now introduce the classical graph cut problem. Given a graph Q(V,£), 
a minimum cut aims to find disjoint subsets of vertices, or partitions, Ci,C 2 , ■■■,Ck 
that minimizes: 


Jcut — Y, X/ X/ 

l<p<g<fc XiGCp XjGCq 

This version of the problem is also known as minimum k-cuts. However, 
solely solving for this objective function often yields insufficient results in prac¬ 
tice. Observe that, this objective function would be minimized by simply taking 
the least connected vertices as partitions. 


Normalized cuts 17 is an alternative problem that aims to cut out larger 


partitions, characterized by the objective function given below. 


J n. 


= E 


E 


XiGC„ Ex„GC„ 


ExiGCp E 


4 ■ 

x„GC„ 


l<p<g'<A; 


Ex, gC„ E 


x,GV 


Ex„gC„ E 


x, GV 


( 20 ) 


Note the similarity between (20) and (19). From here, we have the intuition 


that this cost function must be easily vectorized. Let bp G {0,1}” be an indi¬ 
cator for cluster Cp, a column of the earlier cluster assignment matrix B. Let 


5i = ^ = diag{Si,S 2 , ■■■,Sn) G M"^”. The cost function in (|20l 

can be expressed as: 


, ^bf(A-H)b, 

2^ b;f’Ab, 

1=1 I ‘ 

■ h bFAb, 
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Define as ^Aij , and 

r=(bi/||Al/2bi||,b2/||Al/2b2||,...,bfc/||Al/2bfc||) 
The problem becomes: 


max tr{r^ ^T) 
r^Ar^/.r^o 


(22) 


Observe that simply taking A = I would have made the problem equivalent 


to the one introduced in Section 


4.2 


since then T = BD^ — G, and the con¬ 


straint r^Ar would become G^G = I. Further observe that the existence of A 
in the equation is a prerequisite for the normalization, as in normalized cut. 

Authors of present a multiplicative update rule, along with convergence 
and correctness proofs for solving (22) numerically: 




(^r). 


{DTc 


(a = r'^AF) 


The main advantage arriving with expressing the clustering problem in a 
graph theoretical approach is the ability to use a variety of affinity matrices, or 
kernel functions. As we presented previously, ordinary fc-means can be thought 
of as a graph cut problem on the linear inner-product kernel matrix X^X. 
However with spectral clustering, the kernel can be extended to more powerful 
varieties. 


5.4 Cluster-NMF 

A key shortcoming of clustering by NMF is that the data matrix, and conse¬ 
quently the cluster centroids are constrained to be nonnegative. By relaxing this 
nonnegativity constraint in [^, Ding et al. strengthen the equivalence between 
fc-means and NMF, and generalize the application of the technique to mixed 
sign data matrices. 

Consider our first presentation of /c-means in matrix factorization form, in 
mins ||A — XBDB"’"\\f,. Here, since the data matrix nonnegative, the 
cluster centroid matrix XBD is also constrained. The authors of first re¬ 
lax the nonnegativity constraint on ordinary matrix factorization, introducing 
Semi-NMF. Then, similar to a line of argument presented in Section |2.1[ they 
constrain the left factorizing matrix to convex combinations of the data matrix, 
introducing Convex-NMF: 


min llAij- 
1>F>0,B>0 


X±FB^ 


Here, X± is used to denote that the data matrix now has mixed signs. Note 
that XF is now a matrix of convex combinations of the columns of X. This 
constraint better captures the notion of centroids, if we start interpreting the 
XF as a cluster centroid matrix and B as cluster assignments. Finally, noting 
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that the extra degree of freedom on the matrix F is not necessary, Cluster- 
NMF is given, in exact formulation introduced in earlier sections, but with 
nonnegativity of data matrix relaxed: 


minllXi -X±BDB^\\f 

B>0 

= min ||X± — X±GG^\\f 
G>o 


(23a) 

(23b) 


Note also that without the nonnegativity relaxation, this is exactly equal 
to the PNMF formulation introduced earlier, in that — GG'^X'^W'^p = 

\\X — XGG’^W'jp. The algorithm for optimizing the objective function is also 
given in |^, and will not be repeated here. 


5.5 Theoretical Similarities and Contrasts 

Until this point, we introduced how fc-means can be interpreted in vectorized 
form, and a matrix factorization framework. After introducing the basic no¬ 
tation, we demonstrated that a variety of NMF methods applied to clustering 
adhere to this theoretical foundation. Here, we must briefly contrast these 
methods to understand what underlies their differences in application and per¬ 
formance. 

The divide between SNMF and other methods should be apparent to the 
reader. SNMF simply introduces a regularization embedded into the NMF prob¬ 
lem. The method entails solving two least squares problems at each iteration. 
It further introduces two regularization parameters that the implementer can 
use to trade off interpretability and accuracy. 

The other techniques, PNMF, NSC and Cluster-NMF introduce one key fea¬ 
ture. In the computation of these factoriz ation s, calculating the Gram matrix 
X'^X is necessary. As argued in Section 


2.2 


this means the algorithms are 
readily kernelized and can be extended to work on A = <I>^<I>. As key con¬ 
trasts, NSC introduces the extra normalized cut constraint, and Cluster-NMF 
generalizes the techniques to mixed-sign data matrices. 

Another set of differences of these techniques, which this report does not 
cover is their extensions to more complex, asymmetric divergence (e.g. Kullback- 
Leibler, Itakura-Saito etc.). 

Finally, the authors of above techniques present different multiplicative up¬ 
date rules and algorithms that serve the unique constraints. These variations 
in implementation have been shown to yield diverse results in real-world appli¬ 
cations. 


6 Conclusion 

In this work, we aimed to present NMF, and its applications to the clustering 
problem. 
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Potential applications of NMF in clustering were discussed with the early 
research of the technique. While initial findings with NMF yielded parts-based 
representations and sparser encodings, dense and holistic representations were 
also demonstrated. This led to the pursuit of factorizing for sparser representa¬ 
tions. 

As presented above, sparser representations have a direct link to inter- 
pretability in clustering. When one of the factorizing matrices is to be used 
for cluster assignment, interpreting that matrix requires row-wise sparsity, or 
near-orthogonality. 

Before moving to discuss sparser techniques, we established that a direct link 
between soft fc-means and NMF could be drawn. In later sections, we extended 
this argument to stronger relationships shown between spectral clustering and 
matrix factorization equivalents. 

Finally, we covered an array of matrix factorization methods that can be 
used in clustering. While each such presentation provided unique features, we 
drew on their similarities and demonstrated that nearly all share a theoretical 
basis. 

It must be noted that the NMF problem and extensions to various appli¬ 
cation domains and mathematical consequents are well studied. Extensions 
to higher order factorizations such as Quadratic NMF, tensor decompositions, 
Bayesian inference exist, with consequences relating to the clustering problem. 
These studies were left out of scope for this report, but may well constitute a 
next step in research for the reader. 
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A Some Useful Results Prom Linear Algebra 

This section contains some useful proofs in linear algebra we often refer to in 
this document. 

Lemma 1. Let W be a matrix of orthonormal columns, such that = I. 

Then bFbF^ is a projection matrix. 

Proof. Recall that a projection matrix is defined as a square matrix P such that 
PP = P holds. Then: 


WW^WW^ = W{W^W)W^ 

= WIW^ (as W^W = / by definition) 

= bFbF^ 


(24a) 

(24b) 

(24c) 


□ 
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Lemma 2. For any symmetric projection matrix P € of rank k there 

exists G € such that P = GG^ and G^G = I. 

Proof. It is known that any real symmetric matrix has an eigendecomposition, 
written in the form 


p = QAg^ 

where Q G is a matrix with the eigenvectors of P arranged as its 

columns, and A is a diagonal matrix with corresponding eigenvalues arranged 
on its diagonal, in descending order. 

Note that the eigenvalues of a projection matrix can only be 1 or 0. To 
derive this result, take the v as an eigenvector and A as an eigenvalue of P. 
Then Pv = Av = PPv = PAv = A^v, hence A^ = A, by which we have 
Ag{0,1}. 

Finally, observe that the number of nonzero eigenvalues is equal to the rank 
of a symmetric matrix. Then the matrix A takes the form: 


where Ik is the k x k identity matrix. Then simply defining G as the first k 
columns of Q, i.e. G = (qi, q2,..., qfc) one can see P = GIG^ = GG^. Since 
the columns of Q are orthogonal and unit norm vectors, we have G^G = I. □ 

Lemma 3. mine ||^ — XBDB'^W^p = min^ tr{X'^X) — tr{X'^XBDB'^) 

Proof We start by writing the Frobenius norm as the trace: 

mintr((A: - XBDB^f'{X - XBDB'^)) 

B 

First, we introduce , where Dfj = yJThj- In the next few equations, 
we will make frequent use of some key properties. First of all, as D (and 
D^) is diagonal, = D. Secondly, the trace operator is invariant under 
cyclical permutations, i.e. tr{ABC) = tr{GAB) = tr{BGA). Finally, note 
that by definition of our clustering notation, BD^ has orthonormal columns, 
or [BD^y{BD^) = I. We can now go through the following steps in showing 
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the equivalence: 


mmtr{{X - XBDB'^ f{X - XBDB^)) 

B 

= mintr{{X^ - BDB^X'^){X - XBDB'^)) 

B 

= miTLtr{X'^X - X^XBDB'^ - BDB^X'^X + BDB^X^XBDB'^) 

B 

= min tr{X^X) - tr{X^XBDB'^) - tr{BDB'^X'^X) + tr{BDB'^X'^XBDB^) 

B 

= mintr{X^X) - 2tr{X^XBDB'^) + tr{BD^B^X^XBD^B^) 

B 

= mintr{X^X) - 2tr{X^XBDB'^) + tr{{BDi){BDi)'^X'^XBD^BD^)^) 

B 

= mintr{X^X) - 2tr{X^XBDB'^) + tr{{BD^f'X^XBD^) 

B 

= mintr{X^X) - 2tr{X^XBDB'^) + tr{X^XBD^ {BD^'^) 

B 

= min tr{X^X) - 2tr{X^XBDB'^) + tr{X^XBDB'^) 

B 

= min tr{X^X) - tr{X'^XBDB^) 

□ 
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